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We present two classes of improved estimators for mutual information M(X,Y), from samples 
of random points distributed according to some joint probability density /j,(x,y). In contrast to 
conventional estimators based on binnings, they are based on entropy estimates from k- nearest 
neighbour distances. This means that they are data efficient (with k = 1 we resolve structures down 
to the smallest possible scales), adaptive (the resolution is higher where data are more numerous), 
and have minimal bias. Indeed, the bias of the underlying entropy estimates is mainly due to non- 
uniformity of the density at the smallest resolved scale, giving typically systematic errors which 
scale as functions of k/N for N points. Numerically, we find that both families become exact for 
independent distributions, i.e. the estimator M(X, Y) vanishes (up to statistical fluctuations) if 
/j,(x,y) = fi(x)fi(y). This holds for all tested marginal distributions and for all dimensions of x and 
y. In addition, we give estimators for redundancies between more than 2 random variables. We 
compare our algorithms in detail with existing algorithms. Finally, we demonstrate the usefulness 
of our estimators for assessing the actual independence of components obtained from independent 
component analysis (ICA), for improving ICA, and for estimating the reliability of blind source 
separation. 



I. INTRODUCTION 

Among the measures of independence between random 
variables, mutual information (MI) is singled out by its 
information theoretic background [lj . In contrast to the 
linear correlation coefficient, it is sensitive also to depen- 
dencies which do not manifest themselves in the covari- 
ance. Indeed, MI is zero if and only if the two random 
variables are strictly independent. The latter is also true 
for quantities based on Renyi entropies @ , and these are 
often easier to estimate (in particular if their order is 2 
or some other integer > 2). Nevertheless, MI is unique in 
its close ties to Shannon entropy and the theoretical ad- 
vantages derived from this. Some well known properties 
of MI and some simple consequences thereof are collected 
in the appendix. 

But it is also true that estimating MI is not always 
easy. Typically, one has a set of N bivariate measure- 
ments, Zi — (xi,yi), i = 1,...N, which are assumed to 
be iid (independent identically distributed) realizations 
of a random variable Z — [X, Y) with density /i(x,y). 
Here, x and y can be either scalars or can be elements 
of some higher dimensional space. In the following we 
shall assume that the density is a proper smooth func- 
tion, although we could also allow more singular densi- 
ties. All we need is that the integrals written below exist 
in some sense. In particular, we will always assume that 
log(0) = 0, i.e. we do not have to assume that densities 
are strictly positive. The marginal densities of X and Y 
are fi x (x) = J dyfj,(x,y) and p, y (y) = J dxfj,(x,y). The 
MI is defined as 



I(X,Y) 



J J dxdy fi(x,y) log 



Vx(x)fJ, y (y) 



(1) 



we always will use natural logarithms. The aim is to es- 
timate I(X, Y) from the set {zi} alone, without knowing 
the densities fi,fi x , and (j, y . 

One of the main fields where MI plays an important 
role, at least conceptually, is independent component 
analysis (ICA) Ll |_|. In the ICA literature, very crude 
approximations to MI based on cumulant expansions are 
popular because of their ease of use. But they are valid 
only for distributions close to Gaussians and can mainly 
be used for ranking different distributions by interdepen- 
dence, much less for estimating the actual dependences. 
Expressions obtained by entropy maximalization using 
averages of some functions of the sample data as con- 
straints Q are more robust, but are still very crude ap- 
proximations. Finally, estimates based on explicit pa- 
rameterizations of the densities might be useful but are 
not very efficient. More promising are methods based on 
kernel density estimators 0, |(J . We will not pursue these 
here either, but we will comment on them in Sec. IV. A. 

The most straightforward and widespread approach for 
estimating MI more precisely consists in partitioning the 
supports of X and Y into bins of finite size, and approx- 
imating Eq.([Q) by the finite sum 



I(X,Y) » J binned (A,r) = Vp(i,j)lo 



/ j i 



' Px(i)Py(j) 



The base of the logarithm determines the units in which 
information is measured. In particular, taking base two 
leads to information measured in bits. In the following, 



where p x (i) = /. dx fi x (x), p y (j) = ^dy n v {y), and 
P{hj) — li Ij dxdy ju,(x,y) - and J i means the inte- 
gral over bin i. An estimator of /binned (X, Y) is ob- 
tained by counting the numbers of points falling into the 
various bins. If n x (i) (n y (j)) is the number of points 
falling into the i-th bin of X (j-th bin of Y), and n(i,j) 
is the number of points in their intersection, then we 
approximate p x (i) ps n x (i)/N, p y (j) w n y (j)/N, and 
Pihj) ~ n (hj)/N. It is easily seen that the r.h.s. of 
Eq.(J2J indeed converges to I(X, Y) if we first let N — > oo 



2 



and then let all bin sizes tend to zero, if all densities exist 
as proper (not necessarily smooth) functions. If not, i.e. 
if the distributions are e.g. (multi-)fractal, this conver- 
gence might no longer be true. In that case, Eq.Q would 
define resolution dependent mutual entropies which di- 
verge in the limit of infinite resolution. Although the 
methods developed below could be adapted to apply also 
to that case, we shall not do this in the present paper. 

The bin sizes used in Eq.Q do not need to be the 
same for all bins. Optimized estimators plla] use indeed 
adaptive bin sizes which are essentially geared at hav- 
ing equal numbers n(i,j) for all pairs (i, j) with non-zero 
measure. While such estimators are much better than 
estimators using fixed bin sizes, they still have system- 
atic errors which result on the one hand from approxi- 
mating I(X, Y) by /binned (X , Y) , and on the other hand 
by approximating (logarithms of) probabilities by (log- 
arithms of) frequency ratios. The latter could be pre- 
sumably minimized by using corrections for finite n x {i) 
resp. n(i,j) |jg. These corrections are in the form of 
asymptotic series which diverge for finite N, but whose 
first 2 terms improve the estimates in typical cases. The 
first correction term - which often is not sufficient - was 
taken into account in [(| 0] . 

In the present paper we will not follow these lines, but 
rather estimate MI from fc-nearest neighbour statistics. 
There exists an extensive literature on such estimators 
for the simple Shannon entropy 

H(X) = - J dx^(x)\ogfj,(x), (3) 

dating back at least to 0, . But it seems that these 
methods have never been used for estimating MI. In 
E3, E3, EH EE E3, E3| it is assumed that x is one- 
dimensional, so that the Xi can be ordered by magnitude 
and Xi+i — Xi — > for N — * oo. In the simplest case, the 
estimator based only on these distances is 

1 N ^ 

H{x) « — — £ kete+i - + ~ *K N ) • ( 4 ) 
i=i 

Here, ip(x) is the digamma function, ip(x) = 
T{x)~ x dT(x)/dx. It satisfies the recursion %j){x + 1) = 
tp(x) + l/x and ip(l) = -C where C = 0.5772156... 
is the Euler-Mascheroni constant. For large x, ip(x) as 
log x — l/2x. Similar formulas exist which use Xi+k — %i 
instead of Xi+\ — Xi, for any integer k < N. 

Although Eq.QJ and its generalizations to k > 1 seem 
to give the best estimators of H(X), they cannot be used 
for MI because it is not obvious how to generalize them 
to higher dimensions. Here we have to use a slightly 
different approach, due to ^3 ( see a ls° |20L l2l| : the lat- 
ter authors were only interested in fractal measures and 
estimating their information dimensions, but the basic 
concepts are the same as in estimating H{X) for smooth 
densities). 

Assume some metrics to be given on the spaces 
spanned by X, Y and Z — (X, Y). We can then rank, 



for each point Zi = (xi,yi), its neighbours by distance 
di.j = \ \zi — Zj\\: dij 1 < dij 2 < dij 3 < — Similar rank- 
ings can be done in the subspaces X and Y. The basic 
idea of [T9L I20I l2l| is to estimate H(X) from the average 
distance to the /c-nearest neighbour, averaged over all Xi. 
Details will be given in Sec. II. Mutual information could 
be obtained by estimating in this way H(X), H(Y) and 
H(X,Y) separately and using [l| 

I(X,Y)=H(X) + H(Y)-H(X,Y) . (5) 

But this would mean that the errors made in the individ- 
ual estimates would presumably not cancel, and therefore 
we proceed differently. 

Indeed we will present two slightly different algorithms, 
both based on the above idea. Both use for the space 
Z = (X, Y) the maximum norm, 

\\z-z'\\=m ax {\\x-x%\\y-y'\\}, (6) 

while any norms can be used for ||ac — x'\\ and ||y — y'\\ 
(they need not be the same, as these spaces could be com- 
pletely different). Let us denote by e(i)/2 the distance 
from Zi to its A:-th neighbour, and by e x (i)/2 and e y (i)/2 
the distances between the same points projected into the 
X and Y subspaces. Obviously, e(i) = ma,x{e x (i), e y (i)}. 

In the first algorithm, we count the number n x (i) of 
points Xj whose distance from Xi is strictly less than 
e(?)/2, and similarly for y instead of x. This is illus- 
trated in Fig. la. Notice that e(i) is a random (fluctuat- 
ing) variable, and therefore also n x (i) and n y (i) fluctuate. 
We denote by (. . .) averages both over all i G [1, . . .N] 
and over all realizations of the random samples, 

JV 

(...)= AT 1 £E[...(i)]- (7) 

i=l 

The estimate for MI is then 

lW(X,Y) = ^k)-(^(n x + l)+TP(n y + l))+iJj(N). (8) 

Alternatively, in the second algorithm, we replace n x (i) 
and n y (i) by the number of points with \\xi — xj\\ < 
e x (i)/2 and — yj\\ < e y (i)/2 (see Figs, lb and lc). 
The estimate for MI is then 

I^(X,Y) =iP(k)-l/k-(ilj(n x )+xlj(n y ))+iP(N). (9) 

The derivations of Eqs.© and © will be given in Sec. II. 
There we will also give formulas for generalized redun- 
dancies in higher dimensions, 

I(X 1 ,X 2 ,...X m ) - H(X 1 ) + H(X 2 ) + ... + H(X m ) 
- H(X 1 ,X 2 ,...X m ). (10) 

In general, both formulas give very similar results. For 
the same k, Eq.lJSJ gives slightly smaller statistical er- 
rors (because n x (i) and n y (i) tend to be larger and have 
smaller relative fluctuations) , but have larger systematic 
errors. The latter is only severe if we are interested in 
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FIG. 1: Panel (a): Determination of e(i), n x (i) and n y (i) 
in the first algorithm, for k = 1 and some fixed i. In this 
example, n x (i) — 5 and n y (i) = 3. 

Panels (b),(c): Determination of e x (i), €y(i), n x (i) and n y (i) 
in the second algorithm. Panel (b) shows a case where e x (i) 
and ty(i) are determined by the same point, while panel (c) 
shows a case where they are determined by different points. 



very high dimensions where e(i) tends typically to be 
much larger than the marginal e Xj (i). In that case the 
second algorithm seems preferable. Otherwise, both can 
be used equally well. 

A systematic study of the performance of Eqs.JSJ 
and @ and comparison with previous algorithms will 
be given in Sec. III. Here we will just show results of 
/( 2 ) (X, Y) for Gaussian distributions. Let X and Y be 
Gaussians with zero mean and unit variance, and with 
covariance r. In this case I(X,Y) is known exactly 

/Gau SS (X,r) = -il0g(l-r 2 ). (11) 

In Fig. 2 we show the errors I^ 2 \X, Y)—Iq BuUSS (X : Y) for 
various values of r, obtained from a large number (typi- 
cally 10 5 — 10 7 ) of realizations. We show only results for 
k = 1, plotted against 1/N. Results for k > 1 are sim- 
ilar. To a first approximation I^(X,Y) and I^{X,Y) 
depend only on the ratio k/N. 

The most conspicuous feature seen in Fig. 2, apart from 
the fact that indeed I^(X,Y) - I Gmss {X,Y) -> for 
N — > oo, is that the systematic error is compatible with 
zero for r = 0, i.e. when the two Gaussians are uncor- 
rclatcd. We checked this with high statistics runs for 
many different values of k and N (a priori one should ex- 
pect that systematic errors become large for very small 
N), and for many more distributions (exponential, uni- 
form, ...). In all cases we found that both I^(X,Y) 
and (X, Y) become exact for independent variables. 
Moreover, the same seems to be true for higher order 




0.01 0.02 0.03 0.04 0.05 

1 /N 

FIG. 2: Estimates of I (2) (X,Y) - I cxact {X,Y) for Gaussians 
with unit variance and covariances r = 0.9, 0.6, 0.3, and 0.0 
(from top to bottom), plotted against \/N . In all cases k = 1. 
The number of realizations is > 2 x 10 6 for N <= 1000, and 
decreases to ~ 10 5 for N — 40, 000. Error bars are smaller 
than the sizes of the symbols. 



redundancies. We thus have the 

Conjecture: Eqs.© and @ are exact for indepen- 
dent X and Y, i.e. I™ (X, Y) = (X, Y) = if and 
only if I(X : Y) = 0. 

We have no proof for this very surprising result. We 
have numerical indications that moreover 

| J .M, ( x,r ) -/(x,r ) | ScoMt (12) 

as X and Y become more and more independent, but 
this is much less clean and therefore much less sure. 

In Sec. II we shall give formal arguments for our es- 
timators, and for generalizations to higher dimensions. 
Detailed numerical results for cases where the exact MI 
is known will be given in Sec. III. In Sec. IV. A we give two 
preliminary applications to gene expression data and to 
ICA. Conclusions are drawn in the last section, Sec.V. 
Finally, some general aspects of MI are recalled in an 
appendix. 



II. FORMAL DEVELOPMENTS 

A. Kozachenko - Leonenko Estimate for Shannon 
Entropies 

We first review the derivation of the Shannon entropy 
estimate [TH |2(j, |2l|, |23 , since the estimators for MI are 
obtained by very similar arguments. 

Let X be a continuous random variable with values 
in some metric space, i.e. there is a distance function 
\\x — x'\ \ between any two realizations of X, and let the 
density [i(x) exist as a proper function. Shannon entropy 
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is defined as 



H(X) = — I dxfi(x) log /z(x) , 



(13) 



where "log" will always mean natural logarithm so that 
information is measured in natural units. Our aim is to 
estimate H(X) from a random sample {x\ . . . Xjv) of N 
realizations of X. 

The first step is to realize that Ea. 1)13(1 can be under- 
stood (up to the minus sign) as an average of log/j(x). 

If we had unbiased estimators log fi(x) of the latter, we 
would have an unbiased estimator 



JV 



H(X) = -N- 1 Y,log f i(x i ) 



(14) 



In order to obtain the estimate log/i(xi), we consider 
the probability distribution Pk(e) for the distance be- 
tween Xi and its fe-th nearest neighbour. The proba- 
bility Pk(e)de is equal to the chance that there is one 
point within distance r € [e/2, e/2 + de/2] from Xi, that 
there are k — 1 other points at smaller distances, and 
that N — k — 1 points have larger distances from Xf.- Let 
us denote by Pi the mass of the e-ball centered at Xj, 
Pi( e ) — J\\£—x •||<e/2 ^^(0- Using the trinomial formula 
we obtain 



Pk(e)de 



(N-l)\ 



l!(fc- 1)\(N- k- 1)! 

dpi(e) 



de x p\ 1 x (1 - ft) 



N-k-l 



(15) 



iV- 1 
fc 



dp l {e) 
de 



N-k-l 



(16) 



One easily checks that this is correctly normalized, 
J dePk(e) — 1. Similarly, one can compute the expec- 
tation value of log ^ (e) 



E(logp 4 ) 



de P fe (e) log pi (e) 



N —1 



[ <fj>p fc - 1 (l-p)*-*- 1 logp 
Jo 



(17) 



where t/K 2 -) is the digamma function. The expectation is 
taken here over the positions of all other N — 1 points, 
with Xi kept fixed. An estimator for log[i(x) is then 
obtained by assuming that n(x) is constant in the entire 
e-ball. The latter gives 



Pi(e) « c d e d n(xi) 



(18) 



where <i is the dimension of x and c d is the volume of 
the d-dimensional unit ball. For the maximum norm one 
has simply c d = 1, while c d = tt^ 2 /T{1 + d/2)/2 d for 
Euclidean norm. 



Using Eas. l|17|) and l|18|) one obtains 
\ogfi(x t ) « i/>(k) - i>(N) - d E(loge) - logc d , (19) 
which finally leads to 

d N 

H{X) =-Tp( k ) + i>(N) + log c d + - log e(i) (20) 

i=i 

where e(i) is twice the distance from Xj to its fc-th neigh- 
bour. 

From the derivation it is obvious that Eq. (|20|l would be 
unbiased, if the density fJ.(x) were strictly constant. The 
only approximation is in Ea. (|18fl . For points on a torus 
(e.g. when x is a phase) with a strictly positive density 
one can easily estimate the leading corrections to Eq. i|18|) 
for large N. One finds that they are 0(l/N 2 ) and that 
they scale, for large k and N, as ~ (k/N) 2 . In most 
other cases (including e.g. Gaussians and uniform den- 
sities in bounded domains with a sharp cut-off) it seems 
numerically that the error is ~ k/N or ~ k/N\og(N/k). 

B. Mutual Informations: Estimator I^\X,Y) 

Let us now consider the joint random variable Z = 
(X, Y) with maximum norm. Again we take one of the 
N points Zi and consider the distance e/2 to its fc-th 
neighbour. Again this is a random variable with dis- 
tribution given by Eq. ljPo|) . Also Ea. lfT7jl holds without 
changes. The first difference to the previous subsection is 
in Eq. IjlSjl , where we have to replace d by dz = dx + dy , 
Cd by c dx c d y, and of course Xi by = (xj,yi). With 
these modifications we obtain therefore 



H(X,Y) = iP(k)-ip(N)-log(c dx c dY ) 



N 



^loge(i) 



(21) 



In order to obtain I(X,Y), we have to subtract this 
from estimates for H(X) and H(Y). For the latter, we 
could use Eq.JSOjl directly with the same k. But this 
would mean that we would effectively use different dis- 
tance scales in the joint and marginal spaces. For any 
fixed k 7 the distance to the fc-th neighbour in the joint 
space will be larger than the distances to the neighbours 
in the marginal spaces. Since the bias in Ea. l2Ufl from the 
non-uniformity of the density depends of course on these 
distances, the biases in H(X), H(Y), and in H(X,Y) 
would not cancel. 

To avoid this, we notice that Ea. pOJI holds for any 
value of k, and that we do not have to choose a fixed 
k when estimating the marginal entropies. Assume, as 
in Fig. la, that the fc-th neighbour of Xj is on one of 
the vertical sides of the square of size e(i). In this case, 
if there are altogether n x (i) points within the vertical 
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lines x = Xi ± e(i)/2, then e(i)/2 is the distance to the 
(n x (i) + 1)— st neighbour of x^, and 

1 N 



(22) 



For the other direction (the y direction in Fig. la) this 
is not exactly true, i.e. e(i) is not exactly equal to 
twice the distance to the (n y (i) + 1)— st neighbour, if 
n y (i) is analogously defined as the number of points with 
\\Uj ~ Vi\\ < e (*)/2- Nevertheless we can consider Ea. l|22JI 
also as a good approximation for H(Y), if we replace 
everywhere X by Y in its right hand side (this approx- 
imation becomes exact when n y (i) — ► oo, and thus also 
when N — > oo). If we do this, subtracting H(X, Y) from 
H{X) + H(Y) leads directly to Eq.©. 

These arguments can be easily extended to m random 
variables and lead to 



lW(X 1 ,X 2 ,...X ri 



ip{k) + (m — l)ip(N) (23) 
(ip(n xi ) + ip{n X2 ) + . ..tp(n Xm )) . 



C. Mutual Informations: Estimator I {2) (X, Y) 

The main drawback of the above derivation is that the 
Kozachenko-Leonenko estimator is used correctly in only 
one marginal direction. This seems unavoidable if one 
wants to stick to "balls", i.e. to (hyper-)cubes in the 
joint space. In order to avoid it we have to switch to 
(hyper-)rectangles. 

Let us first discuss the case of two marginal variables 
X and Y, and generalize later to m variables X%, . . . X m . 
As illustrated in Figs, lb and lc, there are two cases to be 
distinguished (all other cases, where more points fall onto 
the boundaries Xi ± e x (i)/2 and yi ± e y (i)/2, have zero 
probability; see however the third paragraph of Sec. Ill): 
Either the two sides e x (i) and e y (i) are determined by the 
same point (Fig. lb), or by different points (Fig. lc). In 
either case we have to replace P k (e) by a 2-dimensional 
density, 



Pk(e x ,e y ) = Pl h \e x ,e y )+Pl c \e x ,e y ) (24) 



with 



r k vanity) 



N-l\ d 2 [qf] 
k J de x de v 



N-k-l 



(25) 



and 



(l _ 1) (»-')M (1 _ p , r -. 

\ k J de x de y 



Here, qi = qi(e x ,e y ) is the mass of the rectangle of size 
e x x e y centered at (a;,-, yi), and pi is as before the mass of 



the square of size e = maxje^, e y }. The latter is needed 
since by using the maximum norm we guarantee that 
there are no points in this square which are not inside 
the rectangle. 

Again we verify straightforwardly that Pk is normal- 
ized, while we have now instead of Ea. (|17f) 



E (log ft 



de x de y P k (e xl e y ) logqi(e x ,e y ) 



^{k)-l/k-^{N) . (27) 



Denoting now by n x (i) and n y (i) the number of points 
with distance less or equal to e x (i)/2 resp. e y (i)/2, we 
arrive at Eq. Q . 

For the generalization to m variables we have to con- 
sider m-dimensional densities Pk(e Xl , . . . e Xm ). The num- 
ber of distinct cases (analogous to the two cases shown in 
Figs, lb and lc) proliferates as m grows, but fortunately 
we do not have to consider all these cases explicitly. One 
sees easily that each of them contributes to Pk a term 



oc (I- Pi) 

de Xl . . . de Xm 



N-k-l 



(28) 



The direct calculation of the proportionality factors 
would be extremely tedious (we did it for m = 3), but 
it can be avoided by simply demanding that the sum is 
correctly normalized. This gives 



Pk(e Xl , . . . e Xm ) = k r 



_ 1 /iV-l\ d m [q k ^ 



dc. 



x 



N-k-l 



(29) 



Calculating again E(logft) = ip(k) — (m — l)/k — i/j(N) 
analytically and approximating the density by a constant 
inside the hyper-rectangle, we obtain finally 

lW(X u X 2 ,...X m ) = V(fc)-(m-l)/fc 

+ (m-l)V'(A) (30) 
- (ip(n xi ) + tlj(n X2 ) + . ..ip(n Xm )) . 

Before leaving this section, we should mention that we 
slightly cheated in deriving (X, Y) (and its general- 
ization torn > 2). Assume that in a particular realization 
we have e x (i) < e y (i), as in Fig. lb,c. In that case we 
know that there cannot be any point in the two rectan- 
gles [xi-€y(i)/2,Xi-€ x (i)/2] x [yi-€ y (i)/2,yi + e y (i)/2] 
and [xi + e x (i)/2,Xi + e y (i)/2] x [y t -e y (i)/2,yi + e y (i)/2] 
(see Fig. 3). While we have taken this correctly into ac- 
count when estimating H(X, Y) (where it was crucial), 
we have neglected it in H(X) and H(Y). There, the cor- 
rections are 0(l/n x ) and 0(l/n y ), and should vanish for 
N — ► oo. It could be that their net effect vanishes, be- 
cause they contribute with opposite signs to H{X) and 
H(Y). But we have no proof for it. Anyhow, due to the 
approximation of constant density within each rectangle 
we cannot expect our estimates to be exact for finite N, 
and any justification ultimately relies on numerics. 
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FIG. 3: There cannot 
be any points inside the 
shaded rectangles. For 
method 2, this means 
that the estimates of the 
marginal entropy H(X) 
(H(Y)) should be modi- 
fied, since part of the area 
outside (inside) the stripe 
of with e x (e y ) is forbid- 
den. This is neglected in 
Eq.gJ. 



III. IMPLEMENTATION AND RESULTS 
A. Some Implementation Details 

Mutual information is invariant under reparametriza- 
tion of the marginal variables. If X' — F(X) and Y' = 
G(Y) are homeomorphisms, then I(X,Y) = I(X',Y') 
(see appendix). This is in contrast to H(X) which 
changes in general under a homeomorphism. This can 
be used to rescale both variables first to unit variance. 
In addition, if the distributions are very skewed and/or 
rough, it might be a good idea to transform them such 
as to become more uniform (or at least single-humped 
and more or less symmetric). Although this is not re- 
quired, strictly spoken, it will in general reduce errors. 
One example is the gamma-exponential distribution in 2 
variables, /x(x,y) = x e exp(— x — xy)/T{8) for x,y > 
[23|. when 9 < 1. For 9^0, the marginal distributions 
develop l/x resp. 1/y singularities (for x — ► and for 
y — > oo, respectively), and the joint distribution is non- 
zero only in a very narrow region near the two axes. In 
this case our algorithm failed when applied directly, but 
it gave excellent results after transforming the variables 
to x' = logx and y' = logy. 

When implemented straightforwardly, the algorithm 
spends most of the CPU time for searching neighbours. 
In the most naive version, we need two nested loops 
through all points which gives a CPU time 0(N 2 ). While 
this is acceptable for very small data sets (say N < 300) , 
fast neighbour search algorithms are needed when deal- 
ing with larger sets. Let us assume that X and Y are 
scalars. An algorithm with complexity 0(Ny/ k N) is 
then obtained by first ranking the Xi by magnitude (this 
can be done by any sorting algorithm such as quicksort), 
and co-ranking the yi with them |2J]. Nearest neigh- 
bours of (xi,yi) can then be obtained by searching x- 
neighbours on both sides of X4 and verifying that their 
distance in y direction is not too large. Neighbours in 
the marginal subspaces are found even easier by ranking 
both Xi and yi. Most results in this paper were obtained 
by this method which is suitable for N up to a few thou- 
sands. The fastest (but also most complex) algorithm is 
obtained by using grids ('boxes') |2^,[2g. Indeed, we use 



three grids: A 2-dimensional one with box size 0{^k/N) 
and two 1-dimensional ones with box sizes 0(1/N). First 
the k neighbours in 2-d space are searched using the 2- 
d grid, then the boxes at distances ±e from the central 
point are searched in the 1-d grids to find n x and n y . 
If the distributions are smooth, this leads to complex- 
ity 0(V~k~N). The last algorithm is comparable in speed 
to the algorithm of For all three versions of our al- 
gorithm it costs only little additional CPU time if one 
evaluates, together with I(X, Y) for some k > 1, also the 
estimators for smaller k. 

Empirical data usually are obtained with few (e.g. 12 
or 16) binary digits, which means that many points in 
a large set may have identical coordinates. In that case 
the numbers n x (i) and n y (i) need no longer be unique 
(the assumption of continuously distributed points is vi- 
olated). If no precautions are taken, any code based on 
nearest neighbour counting is then bound to give wrong 
results. The simplest way out of this dilemma is to add 
very low amplitude noise to the data (~ 10~ 10 , say, when 
working with double precision) which breaks this degen- 
eracy. We found this to give satisfactory results in all 
cases. 

Often, MI is estimated after rank ordering the data, i.e. 
after replacing the coordinate x% by the rank of the z-th 
point when sorted by magnitude. This is equivalent to 
applying a monotonic transformation x — > x' , y — ► y' to 
each coordinate which leads to a strictly uniform empiri- 
cal density, n' x {x') = ^ y {x') = (l/N) E^Ii $( x ' ~ *)• For 
N — > 00 and k ^> 1 this clearly leaves the MI estimate 
invariant. But it is not obvious that it leaves invariant 
also the estimates for finite k, since the transformation 
is not smooth at the smallest length scale. We found nu- 
merically that rank ordering gives correct estimates also 
for small k, if the distance degeneracies implied by it are 
broken by adding low amplitude noise as discussed above. 
In particular, both estimators still gave zero MI for inde- 
pendent pairs. Although rank ordering can reduce statis- 
tical errors, we did not apply it in the following tests, and 
we did not study in detail the properties of the resulting 
estimators. 

B. Results: Two-Dimensional Distributions 

We shall first discuss applications of our estimators 
to correlated Gaussians, mainly because we can in this 
way most easily compare with analytic results and with 
previous numerical analyses. In all cases we shall deal 
with Gaussians of unit variance and zero mean. For to 
such Gaussians with covariance matrix i,k = 1 ... to, 
one has 

I(X 1 ,...X m ) = ~log(det(a)) . (31) 

For to = 2 and using the notation r = cxy, this gives 
Eq.QJ. 

First results for I {2 \X,Y) with k = 1 were already 
shown in Fig. 2. Results obtained with I^(X,Y) are 



7 



79 
0.78 




0.01 0.02 0.03 0.04 0.05 0.06 0.07 
k/N 



0.01 0.02 0.03 0.04 0.05 0.06 0.07 
(k - 0.5) / N 



FIG. 4: Mutual information estimates I W (X,Y) (left panel) 
and (X, Y) (right panel) for Gaussian deviates with 
unit variance and covariance r = 0.9, plotted against 
k/N (left panel) resp. (k - 1/2) /N (right panel). Each 
curve corresponds to a fixed value of N, with N = 
125, 250, 500, 1000, 2000, 4000, 10000 and 20000, from bottom 
to top. Error bars are smaller than the size of the symbols. 
The dashed line indicates the exact value I(X, Y) = 0.830366. 



very similar and would indeed be hard to distinguish on 
this figure. In Fig.Q]we compare values of I^ 1 ' (X, Y) (left 
panel) with those for 1^ (X, Y) (right panel) for different 
values of N and for r = 0.9. The horizontal axes show 
k/N (left) and (k-l/2)/N (right). Except for very small 
values of k and N, we observe scaling of the form 



lW(X,Y) m , 7 (2) (A,y)«$( 



1/2, 



N 



(32) 



This is a general result and is found also for other distri- 
butions. The scaling with k/N of I^(X, Y) results sim- 
ply from the fact that the number of neighbours within a 
fixed distance would scale oc N, if there were no statisti- 
cal fluctuations. For large k these fluctuations should be- 
come irrelevant, and thus the MI estimate should depend 
only on the ratio k/N. For I^(X,Y) this argument has 
to be slightly modified, since the smaller one of e x and e y 
is determined (for large k where the situation illustrated 
in Fig. lc dominates over that in Fig. lb) by k— 1 instead 
of k neighbours. 

The fact that 1^ (X, Y) for a given value of k is be- 
tween (X, Y) for k — 1 and (X, Y) for k is also seen 
from the variances of the estimates. In Fig.[5Jwe show the 
standard deviations, again for covariance r — 0.9. These 
statistical errors depend only weakly on r. For r = 
they are roughly 10% smaller. As seen from Fig. [5J the 
errors of 1^ (X, Y; k) are roughly half-way between those 
of lW(X,Y;k-l) and I^(X, Y; k). They scale roughly 
as ~ \/N, except for very large k/N. Their dependence 
on k does not follow a simple scaling law. The fact that 
statistical errors increase when k decreases is intuitively 
obvious, since then the width of the distribution of e in- 
creases too. Qualitatively the same dependence of the 
errors was observed also for different distributions. For 
practical applications, it means that one should use k > 1 
in order to reduce statistical errors, but too large values 
of k should be avoided since then the increase of system- 
atic errors outweighs the decrease of statistical ones. We 
propose to use typically k = 2 to 4, except when testing 
for independence. In the latter case we do not have to 
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FIG. 5: Standard deviations of the estimates I^\X,Y) 
(+) and I^ 2 \X, Y) (x) for Gaussian deviates with unit 
variance and covariance r — 0.9, multiplied by \fN and 
plotted against k (I {1) (X,Y)) resp. k- 1/2 (J (2) (X,Y)). 
Each curve corresponds to a fixed value of N, with TV = 
125, 250, 500, 1000, 2000, 4000, 10000 and 20000, from bottom 
to top. 



worry about systematic errors, and statistical errors are 
minimized by taking k to be very large (up to k ~ N/2, 
say). 

The above shows that /W (X, Y) and 1^ {X, Y) be- 
have very similar. Also CPU times needed to estimate 
them are nearly the same. In the following, we shall only 
show data for one of them, understanding that every- 
thing holds also for the other, unless the opposite is said 
explicitly. 

For N — > oo, the systematic errors tend to zero, as 
they should. From Figs. 2 and 4 one might conjecture 
that /t 1 ' 2 ) (X, Y) — J oxact {X, Y) ~ TV" 1 / 2 , but this is not 
true. Plotting this difference on a double logarithmic 
scale (Fig. |SJ, we see a scaling ~ N~ 1 / 2 for N w 10 3 , 
but faster convergence for larger N . It can be fitted by 
a scaling ~ I/TV - 85 for the largest values of N reached 
by our simulations, but the true asymptotic behaviour is 
presumably just ~ 1/N. 

As said in the introduction, the most surprising fea- 
ture of our estimators is that they seem to be ex- 
act for independent random variables X and Y. In 
Fig. \7\ we show how the relative systematic errors be- 
have for Gaussians when r — * 0. More precisely, we show 
/(i.a)^ Y)/l£®t(X, Y) for k = 1, plotted against N for 
four different values of r. Obviously these data converge, 
when r — > 0, to a finite function of N. We have observed 
the same also for other distributions, which leads to a 
conjecture stronger than the conjecture made in the in- 
troduction: Assume that we have a one-parameter family 
of 2-d distributions with densities n(x,y;r), with r being 
a real-valued parameter. Assume also that y, factorizes 
for r = ro, and that it depends smoothly on r in the 
vicinity of rrj, with dfi(x,y;r)/dr finite. Then we pro- 
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pose that for many distributions (although not for all!) 

I^(X,Y)/I cxact (X,Y) -» F(fc.JV) (33) 

for r — * rrj, with some function -F^Zc, iV) which is close to 
1 for all k and all N ^> 1, and which converges to 1 for 
N — > oo. Wc have not found a general criterion for which 
families of distributions we should expect Eq. . 

The most precise and efficient previous algorithm for 
estimating MI is the one of Darbellay & Vajda As 
far as speed is concerned, it seems to be faster than 
the present one, which might however be due to a 
more efficient implementation. In any case, also with 
the present algorithm we were able to obtain extremely 
high statistics on work stations within reasonable CPU 
times. To compare our statistical and systematic errors 
with those of we have used the code basic.exe from 
http://siprint.utia.cas.cz/timeseries/. We used the param- 
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FIG. 8: Statistical errors (1 standard deviation) for Gaus- 
sian deviates with r = 0.9, plotted against N . Results from 
I^ 2 '(X, Y)) for k = 1 (full line) are compared to theoretically 
predicted (dashed) and actually measured (dotted line) errors 
from H. 



eter settings recommended in its description. 

This code provides an estimate of the statistical er- 
ror, even if only one data set is provided. When running 
it with many (typically w 10 4 ) data sets, we found that 
these error bars are always underestimated, sometimes by 
rather large margins. This seems to be due to occasional 
outliers which point presumably to some numerical in- 
stability. Unfortunately, having no source code we could 
not pin down the troubles. In Fig. [S] we compare the 
statistical errors provided by the code of Q, the errors 
obtained from the variance of the output of this code, and 
the error obtained from I^(X,Y) with k = 3. We see 
that the latter is larger than the theoretical error from 
0, but smaller than the actual error. For Gaussians with 
smaller correlation coefficients the statistical errors of 
decrease strongly with r, because the partitionings are 
followed to less and less depth. But, as we shall see, this 
comes with a risk for systematic errors. 

Systematic errors of 8] for Gaussians with various val- 
ues of r are shown in Fig. Comparing with Fig. 2 we 
see that they are, for r^0, about an order of magnitude 
larger than ours, except for very large N where they seem 
to decrease as 1/N. Systematic errors of y| are also very 
small when r — 0, but this seems to result from fine tun- 
ing the parameter <5» which governs the pruning of the 
partitioning tree in y| ■ Bad choices of S s lead to wrong 
MI estimates, and optimal choices should depend on the 
problem to be analyzed. No such fine tuning is needed 
with our method. 

As examples of non-Gaussian distributions we studied 

• The gamma-exponential distribution |28| : 

• The ordered Weinman exponential distribution 

El 

• The "circle distribution" ofRef.El. 
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FIG. 9: Systematic errors for Gaussian deviates with r = 
0.0,0.3,0.6, and 0.9, plotted against 1/7V, obtained with the 
algorithm of § . These should be compared to the systematic 
errors obtained with the present algorithm shown in Fig. 2. 
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FIG. 10: Ratios I(X, V ) cst i m / /exact {X, Y) for the gamma- 
exponential distribution, plotted against 1/N. These data 
were obtained with J*- 2 - 1 using k = 1, after transforming Xi 
and iji to their logarithms. The five curves correspond to 
9 = 0.1, 0.3, 1.0, 2.0, 10.0, and 100.0 (from bottom to top). 



For all these, both exact formulas for the MI and de- 
tailed simulations using the Darbellay-Vajda algorithm 
exist. In addition we tested that JW and 1^ vanish, 
within statistical errors, for independent uniform distri- 
butions, for exponential distributions, and when X was 
Gaussian and Y was either uniform or exponentially dis- 
tributed. Notice that 'uniform' means uniform within a 
finite interval and zero outside, so that the Kozachenko- 
Leonenko estimate is not exact for this case either. 

In all cases with independent X and Y we found that 
/ (1 < 2) (X, Y) = within the statistical errors (which typi- 
cally were « 1CP 3 to 10~ 4 ). We do not show these data. 

The gamma-exponential distribution depends on a pa- 
rameter 9 (after a suitable re-scaling of x and y) and is 
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FIG. 11: Ratios I(X,Y) cstirn / I c ^ ct (X,Y) for the gamma- 
exponential distribution, plotted against 1/N. Full lines are 
from estimator I^ 2 \ dashed lines are from |2g|. Our data 
were obtained with k — 1 after a transformation to loga- 
rithms. The four curves correspond to 8 — 0.1,0.3,2.0, and 
100.0 (from bottom to top for our data, from top to bottom 
for the data of H3). 



defined \M as 



fj.(x,y;9) 



T(9) 



(34) 



for x > and y > 0, and n(x,y\ 9) = otherwise. The 
MI is 28] /(X,y) exact = ${9 + 1) - log9. For 9 > 1 
the distribution becomes strongly peaked at x — and 
y = 0. Therefore, as we already said, our algorithms 
perform poorly for 9 3> 1, if we use Xi and yi them- 
selves. But using x[ — logXi and y[ = logyi we obtain 
excellent results, as seen from Fig. ^] There we plot 
again 1^ (X' , Y')/I(X, F) cxact for k = 1 against 1/N, 
for five values of 9. These data obviously support our 
conjecture that 1^ (X' , Y' )/I(X, Y) cxact tends towards 
a finite function as independence is approached. To com- 
pare with |2~8ll. we show in Fig.llllour data together with 
those of [28|. for the same four values of 9 studied also 
there, namely 9 = 0.1,0.3,2.0, and 100.0. We see that 
MI was grossly underestimated in |2S[ . in particular for 
large 9 where I(X,Y) is very small (for 9 ^S> 1, one has 
I(X,Y)^l/29). 

The ordered Weinman exponential distribution de- 
pends on two continuous parameters. Following [2^ we 
consider here onl y t he case where one of these parame- 
ters (called #o in HH) is se t equal to 1, in which case the 
density is 



-2x-{y-x)/e 



(35) 



for x > and y > 0, and /i(x,y;9) — otherwise. The 
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FIG. 12: Ratios I(X, Y)estim //exact (X, Y) for the ordered 
Weinman exponential distribution, plotted against 1/N. Full 
lines are from estimator i' 2 ', dashed lines are from |28|. Our 
data were obtained with k = 1 after a transformation to log- 
arithms. The four curves correspond to 9 = 0.1,0.3, 1.0, and 
100.0 (from top to bottom). 
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FIG. 13: Values of I(X, Y) for the circle distribution of 
Ref . |27| . plotted against the inner radius parameter a. Full 
lines are from estimator with k = 1 and with N = 1000 
(bottom) and N = 10000 (top). Dashed lines are estimates 
from Ref.H3 with N = 1000 and N = 10000, and the semi- 
analytical result of Ref. |27|l (also from bottom to top). 
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Mutual information estimates using I^{X,Y) with k = 
1 are shown in Fig. El Again we transformed (xi,yi) — > 
(logXi,logyi) since this improved the accuracy, albeit 
not as much as for the gamma-exponential distribution. 
More precisely, we plot I^ 2 \X, Y)/I(X, Y) cxact against 
l/N for the same four values of 9 studied also in |28| . 
and we plot also the estimates obtained in [2^. We see 
that MI was severely underestimated in |2^, in particu- 
lar for large 9 where the MI is small (for 9 — > 00, one has 
I(X,Y) « - l)/29 = 0.32247/6*). Our estimates 

are also too low, but much less so. It is clearly seen that 
I^(X\ Y')/I{X, F) cxac t decreases for 9 — ► 00 in contra- 
diction to the above conjecture. This represents the only 
case where the conjecture does not hold numerically. As 
we already said, we do not know which feature of the 
ordered Weinman exponential distribution is responsible 
for this difference. 

The 'circle distribution' of Ref.|22j is defined in terms 
of polar coordinates (r, <p) as uniform in 0, and with a 
triangular radial profile: The radial distribution pn(r) 
vanishes for r < a and r > 1, is maximal at r = (l+a)/2, 
and is linear in the intervals [a, (l+a)/2] and [(l+a)/2, 1]. 
The variables X and Y are obtained as X — r cos 4> and 
Y = rsincj). In Ref.[27j it was shown that its MI can 
be calculated analytically except for one integral which 
has to be done numerically. Tables for the exact and esti- 



mated values of MI in the range 0.1 < a < 0.9 are given in 
Ref. [13, from which it appears that the Darbellay- Vajda 
algorithm is very precise for this case. Unfortunately, our 
own estimates (using again 1^ with k = 1) are in serious 
disagreement with them, see Fig. Moreover, the val- 
ues quoted in Ref.|23] seem to converge to I(X, Y) — > 
for a — > which is impossible (X and Y are not inde- 
pendent even for a — 0). We have no explanation for 
this. In any case, our estimates are rapidly convergent 
for N — ► 00. 



C. Higher Dimensions 

In higher dimensions we shall only discuss applications 
of our estimators to m correlated Gaussians, because as in 
the case of two dimensions this is easily compared to an- 
alytic results (Eq. l|31|) ) and to previous numerical results 
|29| . As already mentioned in the introduction and as 
shown above for 2-d distributions (Fig. 0) our estimates 
seem to be exact for independent random variables. We 
choose the same one-parameter family of 3-d Gaussian 
distributions with all the correlation coefficients equal to 
r as in 29] . In Fig. ^] we show the behavior of the rela- 
tive systematic errors of both proposed estimators. One 
can easily see that the data converge for r — > 0, i.e. when 
all three Gaussians become independent. This supports 
the conjecture made in the previous subsection. In ad- 
dition, in Fig. E| one can see the difference between the 
estimators 1^ and I^ 2 \ For intermediate numbers of 
the points, N ~ 100 — 200, the "cubic" estimator has 
lower systematic error. Apart from that, 1^ evaluated 
for N is roughly equal to I^- 1 ' evaluated for 2N, reflecting 
the fact that effectively uses smaller length scales as 
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FIG. 14: Ratios (X,Y, Z) / I c ^ ct {X,Y, Z) for k = 1 plot- 
ted against 1/N, for 4 different values of r. All Gaussians 
have unit variance and all non-diagonal elements in the cor- 
relation matrix <Ti,k,i 7^ k (correlation coefficients) take the 
value r. 
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FIG. 16: Standard deviations of the estimate i' 1 ' for Gaussian 
deviates with unit variance and covariance r = 0.9, multiplied 
by 2^H. and plotted against k. Each curve corresponds to a 
fixed value of dimension m. Number of samples is N = 10000. 
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FIG. 15: Averages of 7 (1 ' 2) (Xi, (X 2 ...X m )) for k = 1 plotted 
against d, for 3 different values of r = 0.1, 0.5, 0.9. The sample 
size is 50000, ave rag ing is done over 100 realizations (same 
parameters as in |29(l . Fig. 1). Full lines indicate theoretical 
values, pluses (+) are for l' 1 ', crosses (x) for 7^ 2 '. Squares 
and dotted lines are read off from Fig. 1 of Ref . |2fi|l . 



discussed already for d = 2. 

To compare our results in high dimension with 
the ones presented in |2jE| we shall calculate not the 
high dimensional redundancies I{X\, X2, X m ) but the 
MI /((Xl,-X~2, X m _i),X m ) between two variables, 
namely anm-1 dimensional vector and a scalar. For 
estimation of this MI we can use the formulas as for 
the 2-d case (Eq.JSJ and Eq.©, respectively) where 
n x would be defined as the number of points in the 
(to — l)-dimensional stripe of (hyper-)cubic cross sec- 
tion. Using directly Ea. (|46|l would increase the errors 
in estimation (see the appendix for the relation between 



I{X 1 ,X 2 ,...,X m ) and I((Xi,X 2 ,...,X m -i),X m )). 

In Fig. E| we show the average values of I^' 2 \ They 
are in very good agreement with the theoretical ones for 
all three values of the correlation coefficient r and all 
dimensions tested here (in contrast, in [2?j the estimators 
of MI significantly deviate from the theoretical values for 
dimensions > 6). It is impossible to distinguish (on this 
scale) between estimates and I^ 2 \ 

In Fig. E| statistical errors of our estimate are pre- 
sented as a function of the number of neighbours fc. More 
precisely, we plotted the standard deviation of JW mul- 
tiplied by against k for the case where all correlation 
coefficients are r — 0.9. Each curve corresponds to a dif- 
ferent dimension m. The data scale roughly as ~ -7= for 
large dimension. Moreover, these statistical errors seem 
to converge to finite values for k — > 00. This convergence 
becomes faster for increasing dimensions. The same be- 
havior is observed for 1^ . 



IV. APPLICATIONS: GENE EXPRESSION 
DATA AND INDEPENDENT COMPONENT 
ANALYSIS 

A. Gene Expression 

In the first application to real world data, we compare 
our MI estimators to kernel density estimators (KDE) 
made in The data there are gene expression data 
from [30(. They consist of sets of 1=3 300 vectors in a 
high dimensional space, each point corresponding to one 
genome and each dimension corresponding to one open 
reading frame (ORF). Mutual informations between data 
corresponding to pairs of ORFs (i.e., for 2-d projections 
of the set of data vectors) are estimate in to improve 
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eventually the hierarchical clustering made in |30| . It was 
found that KDE performed much better than estimators 
based on binning, but that the estimated Mis were so 
strongly correlated to linear correlation coefficients that 
they hardly carried more useful information. 

Here we re- investigate just the MI estimates of the four 
ORF pairs "A" to "D" shown in Figs. 3, 5, and 7 of @. 
The claim that KDE was superior to binning was based 
on a surrogate analysis. For surrogates consisting of com- 
pletely independent pairs, KDE was able to show that 
all four pairs were significantly dependent, while binning 
based estimators could disprove the null hypothesis of 
independence only for two pairs. In addition, KDE had 
both smaller statistical and systematic errors. Both KDE 
and binning estimators were applied to rank-ordered data 


In KDE, the densities are approximated by sums of 
N Gaussians with fixed prescribed width h centered at 
the data points. In the limit h — ► the estimated MI 
diverges, while it goes to zero for h — ► oo. Our main crit- 
icism of is that the authors used a very large value of 
h (roughly 1/2 to 1/3 of the total width of the distribu- 
tion). This is recommended in the literature [3l|], since 
both statistical and systematic errors would become too 
large for smaller values of h. But with such a large value 
of h one is insensitive to finer details of the distributions, 
and it should not surprise that hardly anything beyond 
linear correlations is found by the analysis. 

With our present estimators J« and we found 
indeed considerably larger statistical errors, when using 
small values of k (k < 10, say). But when using k « 50 
(corresponding to y/k/N ~ 0.4, similar to the ratio h/a 
used in |(|) the statistical errors were comparable to those 
in . Systematic errors could be estimated by using the 
exact inequality Eq. (|48|l given in the appendix (when ap- 
plying this, one has of course to remember that the esti- 
mate of the correlation coefficient contains errors which 
lead to systematic overestimation of the r.h.s. of Eq.l|48|l 
@). For instance, for pair "B" one finds I(X,Y) > 1.1 
from Eq. (|48Jl . While this is satisfied for k < 5 within the 
expected uncertainty, it is violated both by the estimate 
of @ {I{X, Y) w 0.9) and by our estimate for k = 50 
(I(X, Y) « 0.7). With our method and with k w 50, we 
could also show that none of the four pairs is indepen- 
dent, with roughly the same significance as in 

Thus the main advantage of our method is that it does 
not deteriorate as quickly as KDE does for high reso- 
lution. In addition, it seems to be faster, although the 
precise CPU time depends on the accuracy of the inte- 
gration needed in KDE. In |6( also a simplified algorithm 
is given (Eq.(33) of [6J) where the integral is replaced by 
a sum. Although it is supposed to be faster that the al- 
gorithm involving numerical integration (on which were 
based the above estimates), it is much slower that our 
present estimators (it is 0(N 2 ) and involves the eval- 
uation of 3A^ 2 exponential functions). This simplified 
algorithm (which is indeed just a generalized correla- 
tion sum with the Heaviside step function replaced by 



Gaussians) gives also rather big systematic errors, e.g. 
I(X,Y) = 0.66 for pair "B". 

B. ICA 

Independent component analysis (ICA) is a statistical 
method for transforming an observed multi-component 
data set (e.g. a multivariate time series comprising 
n measurement channels) x(<) = (xi(t), xzfy), x n (t)) 
into components that are statistically as independent 
from each other as possible 4] . In the simplest case, x(i) 
could be a linear superposition of n independent sources 
s(t) = (si(t),s 2 (*), ...,s„(t)), 

x(t) = A s(i) , (37) 

where A is a non-singular nx n 'mixing matrix'. In that 
case, we know that a decomposition into independent 
components is possible, since the inverse transformation 

s(i) = Wx(t) with W = A" 1 (38) 

does exactly this. If Ea. (|37|) does not hold, then no de- 
composition into strictly independent components is pos- 
sible by a linear transformation like Eq. I|38|l , but one can 
still search for least dependent components. In a slight 
misuse of notation, this is still called ICA. 

But even if Ea. (|37|l does hold, the problem of blind 
source separation (BSS), i.e. finding the matrix W with- 
out explicitly knowing A, is not trivial. Basically, it re- 
quires that x is such that all superpositions s' = W'x 
with W =/= W are not independent. Since linear combi- 
nations of Gaussian variables are also Gaussian, BSS is 
possible only if the sources are not Gaussian. Otherwise, 
any rotation (orthogonal transformation) s' = Rs would 
again lead to independent components, and the original 
sources s could not be uniquely recovered. 

This leads to basic performance tests for any ICA prob- 
lem: 

1) How independent are the found "independent" com- 
ponents? 

2) How unique are these components? 

3) How robust are the estimated dependencies against 
noise, against statistical fluctuations, and against out- 
liers? 

4) How robust are the estimated components! 
Different ICA algorithms can then be ranked by how 

well they perform, i.e. whether they find indeed the most 
independent components, whether they declare them as 
unique if and only if they indeed are, and how robust 
are the results. While questions 2 and 4 have often been 
discussed in the ICA literature (for a particularly inter- 
esting recent study, see H2), the first (and most basic, in 
our opinion) test has not attracted much interest. This 
might seem strange since MI is an obvious candidate for 
measuring independence, and the importance of MI for 
ICA was noticed from the very beginning. We believe 
that the reason was the lack of good MI estimators. We 
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propose to use our MI estimators not only for testing the 
actual independence of the components found by stan- 
dard ICA algorithms, but also to use them for testing 
for uniqueness and robustness. We will also show how 
our estimators can be used for improving the decompo- 
sition obtained from a standard ICA algorithm, i.e. for 
finding components which are more independent. Algo- 
rithms which use our estimators for ICA from scratch 
will be discussed elsewhere. 

It is useful to decompose the matrix W into two fac- 
tors, W = RV, where V is a prewhitening that trans- 
forms the covariance matrix into C = VCV T = 1, and 
R is a pure rotation. Finding and applying V is just 
a principal component analysis (PCA) together with a 
rescaling, so the ICA problem proper reduces to finding 
a suitable rotation after having the data prewhitened. In 
the following we always assume that the prewhitening 
(PCA) step has already been done. 

Any rotation can be represented as a product of ro- 
tations which act only in some 2x2 subspace, R = 
R,, : -o). where 



Ry(0)(xi 



with 



(39) 



x i = cos Xi + sin (p Xj, Xa = — sin <p Xi + cos <p xa 



For such a rotation one has (see appendix) 
I(RiMX) - /(X) = I{X[,X'A - I(Xi,Xj) 



(40) 



(41) 



i.e. the change of I(X\ . . . X n ) under any rotation can 
be computed by adding up changes of two- variable Mis. 
This is an important numerical simplification. It would 
not hold if MI is replaced by some other similarity mea- 
sure, and it indeed is not strictly true for our estimates 
/W and I {2) . But we found the violations to be so small 
that Ea.l|4ip can still be used when minimizing MI. 

Let us illustrate the application of our MI estimates 
to a fetal ECG recorded from the abdomen and thorax 
of a pregnant woman (8 electrodes, 500 Hz, 5s). Wc 
chose this data set because it was analyzed by several 
ICA methods [32j, |33j and is available in the web [35j . In 
particular, we will use both jW and 1^ to check and 
improve the output of the JADE algorithm |34[ (which is 
a standard ICA algorithm and was more successful with 
these data than TDSEP [36|, see H2). 

The output of JADE for these data, i.e. the suppos- 
edly least dependent components, are shown in Fig. El 
Obviously channels 1-3 are dominated by the heartbeat 
of the mother, and channel 5 by that of the child. Chan- 
nels 4 and 6 still contain large heartbeat components 
(of mother and child, respectively), but look much more 
noisy. Channels 7-8 seem to be dominated by noise, but 
with rather different spectral composition. The pair- 
wise Mis of these channels are shown in Fig. [^J (left 
panel) 37]. One sees that most Mis are indeed small, 
but the first 3 components are still highly interdepen- 
dent. This could be a failure of JADE, or it could mean 
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FIG. 17: Estimated independent components using JADE. 
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FIG. 18: Left panel: Pairwise Mis between all ICA- 
components obtained by JADE, estimated with J™, ft = 1. 
The diagonal is set to zero. 

Right panel: Pairwise Mis between the optimized channels 
shown in Fig. 1191 



that the basic model does not apply to these compo- 
nents. To decide between these possibilities, we mini- 
mized I(Xi . . . Xg) by means of Eas- ip^ - (|4*TJl . For each 
pair (i,j) with i,j = 1...8 we found the angle which 
minimized I{X' i ,X'^) — I(Xi,Xj), and repeated this al- 
together w 10 times. We did this both for I^ 1%> and 
/( 2 ), with k = 1. We checked that I{Xi . . . X 8 ), calcu- 
lated directly, indeed decreased (from Ij\ DE — 1.782 to 

l£l = 1.160, and from lf ADE = 2.264 to = 1.620). 

The resulting components are shown in Fig. 1191 The 
first 2 components look now much cleaner, all the noise 
from the first 3 channels seems now concentrated in chan- 
nel 3. But otherwise things have not changed very much. 
The pairwise MI after minimization are shown in Fig. 1181 
(right panel) . As suggested by Fig. ^] channel 3 is now 
much less dependent on channels 1 and 2. But the latter 
are still very strongly interdependent, and a linear super- 
position of independent sources as in Eq. (|37Jl can be ruled 
out. This was indeed to be expected: In any oscillating 
system there must be at least 2 mutually dependent com- 
ponents involved, and generically one expects both to be 
coupled to the output signal. 
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FIG. 19: Estimated independent components after minimiz- 
ing I 1 . 



To test for the uniqueness of the decomposition, 
computed the variances 



1 

2^ Jo 



/(RO^PQ,^-))- J(JQ,J0) 



n 2 



where 



(42) 



I{X i ,X j ) = ^-J^ d(f)I(R((f>)(Xi,Xj)) . (43) 

If Cjj is large, the minimum of the MI with respect to ro- 
tations is deep and the separation is unique and robust. 
If it is small, however, BSS can not be achieved since the 
decomposition into independent components is not ro- 
bust. Results for the JADE output are shown in Fig. 1201 
(left panel) , those for the optimized decomposition in the 
right panel of Fig. [3UJ The most obvious difference be- 
tween them is that the first two channels have become 
much more clearly distinct and separable from the rest, 
while channel 3 is less separable from the rest (except 
from channel 5). This makes sense, since channels 3, 
4, 7, and 8 now contain mostly Gaussian noise which is 
featureless and thus rotation invariant after whitening. 
Most of the signals are now contained in channels 5 (fe- 
tus) and in channels 1 and 2 (mother). 

These results are in good agreement with those of [3^ , 
but are obtained with less numerical effort and can be 
interpreted more straightforwardly. 



V. CONCLUSION 

We have presented two closely related families of mu- 
tual entropy estimators. In general they perform very 
similarly, as far as CPU times, statistical errors, and sys- 
tematic errors are concerned. Their biggest advantage 
seems to be in vastly reduced systematic errors, when 
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FIG. 20: Square roots of variances, ^/oTJ, of i"' ((Xi, Xj)) 
(with k — 1) from JADE output (left panel) and after mini- 
mization of MI (right panel). Again, elements on the diagonal 
have been put to zero. 



compared to previous estimators. This allows us to use 
them on very small data sets (even less than 30 points 
gave good results) . It also allows us to use them in inde- 
pendent component analyses to estimate absolute values 
of mutual dependencies. Traditionally, contrast functions 
have been used in ICA which allow to minimize MI, but 
not to estimate its absolute value. We expect that our 
estimators will also become useful in other fields of time 
series and pattern analysis. One large class of problems 
is interdependencies in physiological time series, such as 
breathing and heart beat, or in the output of different 
EEG channels. The latter is particularly relevant for 
diseases characterized by abnormal synchronization, like 
epilepsy or Parkinson's disease. In the past, various mea- 
sures of interdependence have been used, including MI. 
But the latter was not employed extensively (see, how- 
ever, [13), mainly because of the supposed difficulty in 
estimating it reliably. We hope that the present estima- 
tors might change this situation. 
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Appendix 

We collect here some well known facts about MI, 
in particular for higher dimensions, and some imme- 
diate consequences. The first important property of 
I{X, Y) is its independence with respect to reparame- 
tizations. If X 1 = F(X) and Y' = G{Y) are homeo- 
morphisms (smooth and uniquely invertible maps), and 
J x = \\ax/dX'\\ and J Y = \\dY/dY'\\ are the Jacobi 
determinants, then 

M'(V,y') = J x (x')My'Mx,y) (44) 

and similarly for the marginal densities, which gives 



i(x'X) 



dx'dy'fi'(x',y') log 



/4(z'K(2/) 
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f f , , . . a(x, y) 

= / / dxdy fi{x, y) log — — — — 

= I(X,Y). (45) 

The next important property, checked also directly 
from the definitions, is 

I(X,Y,Z) = I((X,Y),Z)+I(X,Y) . (46) 

This is analogous to the additivity axiom for Shannon 
entropies [|, and says that MI can be decomposed into 
hierarchical levels. By iterating it, one can decompose 
I(Xi . . . X n ) for any n > 2 and for any partitioning of 
the set (Xi . . . X n ) into the MI between elements within 
one cluster and MI between clusters. 

Let us now consider a homeomorphism (X',Y') = 
F{X,Y). By combining Eas. (|4*5*)) and (|4*r)f we obtain 

I(X',Y',Z) = I{{X',Y'),Z)+I{X',Y') 

= I((X,Y),Z)+I(X',Y') (47) 
= I(X,Y,Z) + [I(X',Y')-I(X,Y)} . 

Thus, changes of high dimensional redundancies under 
reparametrization of some subspace can be obtained by 
calculating Mis in this subspace only. Although this is 
a simple consequence of well known facts about MI, it 
seems to have not been noticed before. It is numerically 
extremely useful, and would not hold in general for other 
interdependence measures. Again it generalizes to any 
dimension and to any number of random variables. 

It is well known that Gaussian distributions maximize 
the Shannon entropy for given first and second moments. 
This implies that the Shannon entropy of any distribu- 
tion is bounded from above by (l/2)logdetC where C 
is the covariance matrix. For MI one can prove a sim- 
ilar result: For any multivariate distribution with joint 
covariance matrix C and variances <Ji = Ca for the indi- 
vidual (scalar) random variables Xi, the redundancy is 
bounded from below, 

s 1 detC 
I{X 1 , . . . X m ) > - log . (48) 

l (71 . . . <7 m 



The r.h.s. of this inequality is just the redundancy of 
the corresponding Gaussian, and to prove Ea.l|18]l it we 
must show that the distribution minimizing the MI is 
Gaussian. 

In the following we sketch only the proof for the case 
of 2 variables X and Y, the generalization to m > 2 
being straightforward. We also assume without loss of 
generality that X and Y have zero mean. To prove 
Eq. (|48|) . we set up a minimization problem where the 
constraints (correct normalization and correct second 
moments; consistency relations fi x (x) = J dy [i(x,y) and 
Hy(y) = J dx /i(x,y)) are taken into account by means 
of Lagrangian multipliers. The "Lagrangian equation" 
6L/6fi(x, y) = leads then to 



K*, v) = z My) e— 2 - 6 * 2 — « (49) 

where Z, a, b, and c are constants fixed by the constraints. 
Since the minimal MI decreases when the variances a x = 
C xx and cjy = C yy increase with C xy fixed, the constants 
a and b are non- negative. Ea. H49|) is obviously consistent 
with fi(x, y) being a Gaussian. To prove uniqueness, we 
integrate Ea. (|49|l over y and put x — —iz/c, to obtain 

Z e~ az2 / c2 = J 'dy {n v (y)e~ bv2 } . (50) 



This shows that e~ by /i y (y) is the Fourier transform of 
a Gaussian, and thus fj, y (y) is also Gaussian. The same 
holds of course true for [i x (x) , showing that the minimiz- 
ing fi(x,y) must be Gaussian, QED. 

Finally, we should mention some possibly confusing no- 
tations. First, MI is often also called transinformation or 
redundancy. Secondly, what we call higher order redun- 
dancies are called higher order Mis in the ICA literature. 
We did not follow that usage in order to avoid confusion 
with cumulant-type higher order Mis |39| . 
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